Notes for me:

  • Remove always show toolbar before starting
  • Set zoom to 150%
  • check there is internet















Machine Learning with Python and Scikit Learn

Follow presentation on: http://bit.ly/ML-SpringCampus

















What is Machine Learning?

The process of teaching computers to learn from data.

Learning tasks:

  • Clustering

  • Regression

  • Outlier Detection

  • Classification

  • Time series prediction

  • ....













Supervised vs Unsupervised Learning





















Let's look at some code:

We will use:


In [ ]:
import warnings

import numpy as np
import pandas as pd

from time import time

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

import Utils
from Utils import cmap_light
from Utils import cmap_bold









Boston House Prices


In [ ]:
boston_dataset = datasets.load_boston()
print(boston_dataset.DESCR)
















The data

Dataset:

a set of examples each characterized by features and usually there is a label variable that you want to predict.
















Let's see a real example:


In [ ]:
X = boston_dataset.data
Y = boston_dataset.target

names = list(boston_dataset.feature_names) + ['Price']

labels = np.reshape(Y,
                     (Y.shape[0], 1))
df = pd.DataFrame(data=np.concatenate((X, labels), axis=1),
                 columns=names)
df.head(10)

In [ ]:
df_tmp = df[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX',
             'RM', 'Price']]
df_tmp.head(10)











In [ ]:
df_tmp.describe()











In [ ]:
from Utils import plot_boston_dataset
plot_boston_dataset(boston_dataset.data, 
                    boston_dataset.target)



















First model: Linear Regression

The 95% confidence interval suggests Rexthor's dog could also be a cat, or possibly a teapot.





In [ ]:
model = LinearRegression()

model.fit(X, Y)

r2 = model.score(X, Y)

print("R^2 value: {:0.3f}".format(r2))










Congratulations!











In [ ]:
example_n = np.random.randint(0, Y.shape[0])

Utils.describe_example_boston_dataset(X[example_n])

print("\n\nPredicted price: {:2.2f} Real value: {:2.2f}".format(
        model.predict(X[example_n].reshape(1, -1))[0], Y[example_n]))










Distinguishing Species of Iris plants:



Source: Big Cypress National Preserve











In [ ]:
iris_dataset = datasets.load_iris()

print("Features: " + str(iris_dataset.feature_names))
print("Classes: " + str(iris_dataset.target_names))

X = iris_dataset.data
y = iris_dataset.target

In [ ]:
# Load it to a DF
idx = np.random.permutation(150)
y = y[idx]
X = X[idx]

labels = np.reshape(y,
                    (y.shape[0], 1))
df = pd.DataFrame(data=np.concatenate((X, labels), axis=1),
                 columns=iris_dataset.feature_names + ['Class'])
df.head(10)

In [ ]:
df.describe()

In [ ]:
# Let's take a peak at the data:
plt.figure(figsize=(8,8))
colors = "bry"
for i, color in zip([0, 1, 2], colors):
    idx = np.where(y == i)
    plt.scatter(X[idx, 0], X[idx, 1], c=color, cmap=plt.cm.Paired)
    
plt.text(5.25, 2.20, "Versicolor", fontsize=14)
plt.text(7, 3.5, "Virginica", fontsize=14)
plt.text(4.5, 3.75, "Setosa", fontsize=14)

plt.title("The 3 different Iris species", fontsize=18, 
          fontweight='bold')    
plt.xlabel(iris_dataset.feature_names[0], fontsize=14)
plt.ylabel(iris_dataset.feature_names[1], fontsize=14)

plt.show()

In [ ]:
# We will focus identifying only the Iris Setosa
plt.figure(figsize=(8,8))
colors = "br"

idx = np.where(y == 0) # Give me the indices of the Iris Setosa examples

plt.scatter(X[idx, 0], X[idx, 1], c='b', cmap=plt.cm.Paired)
plt.text(4.5, 3.75, "Setosa", fontsize=14)

idx = np.where(y != 0) # where it's not Iris Setosa 
plt.scatter(X[idx, 0], X[idx, 1], c='r', cmap=plt.cm.Paired)
plt.text(7.0, 2.5, "Others", fontsize=14)


plt.title("Scatter plot of Iris Setosa and the others Iris",
          fontsize=18, fontweight='bold')  
plt.xlabel(iris_dataset.feature_names[0], fontsize=14)
plt.ylabel(iris_dataset.feature_names[1], fontsize=14)
plt.show()










Second model Logistic Regression




In [ ]:
# We only care about whether each flower is a 
#     Iris Setosa and we are looking only at two of their features

X = iris_dataset.data
y = iris_dataset.target

new_y = y == 0

model = LogisticRegression(random_state=42, verbose=0)

model.fit(X[:,0:2], new_y)

accuracy = model.score(X[:,0:2], new_y)

print("Accuracy: {:0.3f}%".format(accuracy*100))

In [ ]:
from Utils import predict_mesh

# Let's take a look at what our model is doing

# First plot the examples
plt.figure(figsize=(8,8))
colors = "br"

idx = np.where(y == 0)
plt.scatter(X[idx, 0], X[idx, 1], c='b', cmap=plt.cm.Paired)
plt.text(4.5, 3.75, "Setosa", fontsize=14)

idx = np.where(y != 0)
plt.scatter(X[idx, 0], X[idx, 1], c='r', cmap=plt.cm.Paired)
plt.text(7.0, 2.5, "Others", fontsize=14)

(xx, yy, Z) = predict_mesh(X, model)
plt.contour(xx, yy, Z, cmap=plt.cm.Paired)


plt.title("Decision Boundary", fontsize=18, fontweight='bold')   
plt.xlabel(iris_dataset.feature_names[0], fontsize=14)
plt.ylabel(iris_dataset.feature_names[1], fontsize=14)
plt.show()










Linear Regression and Logistic Regression

So how do these models work?

Let's start with linear regression:

$$ \hat{y} = w_0 + w_1.x_1 + w_2.x_2 + w_3.x_3$$

Adding a $x_0=1$ we get

$$ \hat{y} = w^T \cdot x $$

For each variable we have a weight, an "importance", and the linear combination of the weights and features results in our estimated value $\hat{y}$.













What about the weights?















Questions?















Logistic Regression

Same model + Classifier function

Model:

$$ \hat{y} = w^T \cdot x $$



Model + classification:

$$ \hat{y} = g(w^T \cdot x) $$














Sigmoid function

$$g(z) = \frac{1}{1 + e^{-z}}$$






In [ ]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

x = np.arange(-20, 20, 0.001)
y = sigmoid(x)

plt.figure(figsize=(10,5))
plt.plot(x, y)
plt.title("Sigmoid Function", fontsize=14)
plt.show()










Now to a real world example!










First look at the data

Data from: Whisky Classified


In [ ]:
# Read the data file and drop the collumns we don't care about:
whisky_dataframe = pd.read_csv(
    filepath_or_buffer="whiskies.csv", header=0, sep=',',
    index_col=1)
whisky_dataframe.drop(['RowID', 'Postcode', ' Latitude',
                       ' Longitude'], inplace=True, axis=1)
whisky_dataframe.head(10)











In [ ]:
whisky_dataframe.describe()











In [ ]:
Utils.plot_whisky_histograms(whisky_dataframe)

In [ ]:
Utils.plot_whiky_body_correlation(whisky_dataframe)










Let's do a Detour




















The Curse of Dimentionality

More features != Better Results

When the dimensionality increases, the volume of the space increases so fast that the available data become sparse.




















In [ ]:
Utils.plot_1d_random_data(0.5, 30)

In [ ]:
Utils.plot_2d_random_data(0.5, 30)










Feature selection and extraction



















PCA


In [ ]:
random_data_1 =np.random.multivariate_normal(
    mean= [0, 0], cov=[[5, 5], [0, 0.5]], size=100)

random_data_2 =np.random.multivariate_normal(
    mean= [6, 6], cov=[[5, 5], [0, 0.5]], size=100)

random_data = np.concatenate([random_data_1, random_data_2], axis=0)
random_labels = np.concatenate([np.ones((100,1)),np.zeros((100,1))], axis=0)

fig = plt.figure(figsize=(8, 8))

plt.scatter(random_data[:, 0], random_data[:, 1], c=random_labels, cmap=cmap_light)
#plt.scatter(random_data_2[:, 0], random_data_2[:, 1], c='r')

plt.plot([-5, 10], [-5, 10], 'r--')
plt.plot([5, 0], [0, 5], 'g--')

plt.xlim((-7, 14))
plt.ylim((-7, 14))
plt.title('Random Data with Principal Components', fontsize=16)

plt.xlabel('Random Dimension 1', fontsize=14)
plt.ylabel('Random Dimension 2', fontsize=14)

plt.show()

In [ ]:
pca = PCA(n_components=2)
pca.fit(random_data)
transformed_data = pca.fit_transform(random_data)

plt.figure(figsize=(8,6))
plt.scatter(transformed_data[:,0], transformed_data[:,1],
            c=random_labels, cmap=cmap_light)
plt.plot([-10, 10], [0, 0], 'r--')
plt.xlim((-10, 10))
plt.ylim((-5, 5))
plt.title('Transformed Random Data', fontsize=16)
plt.xlabel('Random Dimension 1', fontsize=14)
plt.ylabel('Random Dimension 2', fontsize=14)

plt.show()

In [ ]:
pca = PCA(n_components=1)
pca.fit(random_data)
transformed_data = pca.fit_transform(random_data)

plt.figure(figsize=(8,5))
plt.scatter(transformed_data[:,0], np.zeros((200,1)),
            c=random_labels, cmap=cmap_light)
plt.plot([-10, 10], [0, 0], 'r--')
plt.xlim((-10, 10))
plt.ylim((-5, 5))
plt.title('Transformed Random Data', fontsize=16)
plt.xlabel('Random Dimension 1', fontsize=14)

plt.show()

print("% of variance explained by PCA: {:0.1f}% \
        ".format(
        pca.explained_variance_ratio_[0]*100))


















Model complexity and overfitting



















In [ ]:
# Adapted from: http://scikit-learn.org/stable/auto_examples/linear_model/plot_polynomial_interpolation.html 
# Author: Mathieu Blondel
#         Jake Vanderplas
# License: BSD 3 clause

def f(x, noise=False):
    """ function to approximate by polynomial interpolation"""
    if(noise):
        return np.sin(x) + np.random.randn(x.shape[0])/4
    return np.sin(x)

space_size = 6

# generate points used to plot
x_plot = np.linspace(-space_size, space_size, 100)

# generate points and keep a subset of them
x = np.linspace(-space_size, space_size, 100)
rng = np.random.RandomState(42)
rng.shuffle(x)
x = np.sort(x[:10])
y = f(x, True)

# create matrix versions of these arrays
X = x[:, np.newaxis]
X_plot = x_plot[:, np.newaxis]

colors = ['teal', 'yellowgreen', 'gold', 'blue']
lw = 2

fig = plt.figure(figsize=(12,12))
    

for count, degree in enumerate([1, 3, 6, 10]):
    ax = fig.add_subplot(2, 2, count+1)
    ax.plot(x_plot, f(x_plot), color='cornflowerblue', linewidth=lw,
         label="ground truth")
    ax.scatter(x, y, color='navy', s=30, marker='o', label="training points")
    model = make_pipeline(PolynomialFeatures(degree), Ridge())
    model.fit(X, y)
    y_plot = model.predict(X_plot)
    ax.plot(x_plot, y_plot, color=colors[count], linewidth=lw,
             label="degree %d" % degree)

    ax.legend(loc='lower left')
    ax.set_ylim((-5, 5))
plt.show()



















Back to Scotch!

Let's apply what we learned to our dataset


In [ ]:
whisky_data = whisky_dataframe.values

pca = PCA(n_components=2, whiten=True) 

# Here whiten means centering the data around 0, 
# which is needed so that PCA works correctly
transformed_data = pca.fit_transform(whisky_data)

In [ ]:
print("% of variance explained by each component: \
       \n 1st {:0.1f}% \
       \n 2nd {:0.1f}% \
        ".format(
        pca.explained_variance_ratio_[0]*100, 
        pca.explained_variance_ratio_[1]*100))

In [ ]:
fig = plt.figure(figsize=(8,6))
plt.scatter(x = transformed_data[:,0], y=transformed_data[:,1])

plt.xlim((-3, 5))
plt.ylim((-3, 5))

plt.title('Transformed Whisky Data', fontsize=16)
plt.xlabel('Principal Component 1', fontsize=14)
plt.ylabel('Principal Component 2', fontsize=14)

plt.show()



















Predicting whether it has Tobacco taste


In [ ]:
labels = whisky_dataframe['Tobacco']
whisky_data = whisky_dataframe.drop('Tobacco', axis=1).values

In [ ]:
print("Percentage of Positive Labels: {:.2f}%".format(
        np.sum(labels)/len(labels)*100))













Unbalanced dataset


In [ ]:
pca = PCA(n_components=2, whiten=True) 
# Here whiten means centering the data around 0, 
# which is neededso that PCA works correctly
transformed_data = pca.fit_transform(whisky_data)

In [ ]:
train_data, test_data, train_labels, test_labels = train_test_split(
    transformed_data, labels, test_size=0.30, random_state=0)

# Without Class weights
classf = LogisticRegression()

# With Class weights
class_weight={0:1, 1: 12}
classf = LogisticRegression(class_weight=class_weight)

classf.fit(train_data, train_labels)

accuracy = classf.score(train_data, train_labels)

print("\n\nTraining Accuracy:\t {:0.3f}%\n\n".format(accuracy*100))

accuracy = classf.score(test_data, test_labels)

print("Test Accuracy:\t\t {:0.3f}%\n\n".format(accuracy*100))

Confusion Matrix


In [ ]:
print("\tTraining \n")
predicted_labels = classf.predict(train_data)
cm = confusion_matrix(train_labels, predicted_labels)
Utils.print_cm(cm)

print("\n\tTesting \n")

predicted_labels = classf.predict(test_data)
cm = confusion_matrix(test_labels, predicted_labels)
Utils.print_cm(cm)















Cross Validation




In [ ]:
class_weight={0:1, 1: 12}

classf = LogisticRegression(random_state=42, 
                            class_weight=class_weight)
#classf = LogisticRegression(random_state=42)

In [ ]:
# Select parameters to use in Cross-Validation
classf_cv = classf
data_cv = transformed_data
N_CV = 10

# Cross Validation
t0 = time()
scores = cross_val_score(classf_cv, data_cv, labels, cv = N_CV)
print("Scores: ")
for i, score in enumerate(scores):
    print( '\t' + str(i) + ':\t' + str(score)) 
print("Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))
print("\nCross val done in %0.3fs." % (time() - t0))




























Unsupervised Learning

Clustering: K -means

TODO: Explain with toy data set, add video / GIF?


In [ ]:

Clustering Scotch:












Model Complexity and Overfitting













In [ ]:
random_data = np.random.randn(100, 2)
random_labels = np.random.randint(0,2,100)

fig = plt.figure(figsize=(8,8))

plt.scatter(random_data[:, 0], random_data[:, 1],
            c=random_labels, cmap=cmap_bold)

plt.xlabel('Random Dimension 1', fontsize=14)
plt.ylabel('Random Dimension 2', fontsize=14)

plt.show()

K-Nearest Neighbors Classifier


In [ ]:
clf = KNeighborsClassifier(n_neighbors=1)
clf = KNeighborsClassifier(n_neighbors=10)

clf.fit(random_data, random_labels)

print("Accuracy: {:0.3f}%".format(
        clf.score(random_data, random_labels)*100))

In [ ]:
(xx, yy, Z) = Utils.predict_mesh(random_data, clf, h=0.01)

fig = plt.figure(figsize=(8,8))
plt.xlabel('Random Dimension 1', fontsize=14)
plt.ylabel('Random Dimension 2', fontsize=14)
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

plt.scatter(random_data[:, 0], random_data[:, 1],
            c=random_labels, cmap=cmap_bold)

plt.show()

In [ ]:


In [ ]:


In [ ]:
random_labels = np.concatenate([np.ones((50,)), np.zeros((50,))])

random_data = np.concatenate([
    np.add(np.multiply(np.random.randn(50, 2), 
                       np.array([0.7, 1.5])), np.array([3, 1])),
    np.multiply(np.random.randn(50, 2), 
                np.array([0.5, 3]))
    ]) 


fig = plt.figure(figsize=(8, 8))

plt.scatter(random_data[:, 0], random_data[:, 1], 
            c=random_labels, cmap=cmap_bold)
plt.xlim((-4, 8))
plt.ylim((-6, 6))
plt.xlabel('Random Dimension 1', fontsize=14)
plt.ylabel('Random Dimension 2', fontsize=14)

plt.show()

In [ ]:

There is so much more

This can not even be considered scraping the surface. Go ahead and experiment it's a very interesting field, and there are tons of information and places to learn from!


In [ ]:
whisky_data = pd.read_csv(
    filepath_or_buffer="Meta-Critic Whisky Database – Selfbuilts Whisky Analysis.csv")
whisky_data.describe()
whisky_data.head()

In [ ]:


In [ ]:


In [ ]: